NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Fegade, Pratik; Chen, Tianqi; Gibbons, Phillip; Mowry, Todd (May 2024, Proceedings of Machine Learning and Systems 2024)

Full Text Available
The Key to Effective UDF Optimization: Before Inlining, First Perform Outlining

https://doi.org/10.14778/3696435.3696436

Arch, Samuel; Liu, Yuchen; Mowry, Todd C; Patel, Jignesh M; Pavlo, Andrew (September 2024, Proceedings of the VLDB Endowment)

Although user-defined functions (UDFs) are a popular way to augment SQL's declarative approach with procedural code, the mismatch between programming paradigms creates a fundamental optimization challenge. UDF inlining automatically removes all UDF calls by replacing them with equivalent SQL subqueries. Although inlining leaves queries entirely in SQL (resulting in large performance gains), we observe that inlining the entire UDF often leads to sub-optimal performance. A better approach is to analyze the UDF, deconstruct it into smaller pieces, and inline only the pieces that help query optimization. To achieve this, we propose UDF outlining, a technique to intentionally hide pieces of a UDF from the optimizer, resulting in simpler UDFs and significantly faster query plans. Our implementation (PRISM) demonstrates that UDF outlining improves performance over conventional inlining (on average 1.29× speedup for DuckDB and 298.73× for SQL Server) through a combination of more effective unnesting, improved data skipping, and by avoiding unnecessary joins.
more » « less
Full Text Available
Memento: Architectural Support for Ephemeral Memory Management in Serverless Environments

https://doi.org/10.1145/3613424.3623795

Wang, Ziqi; Zhao, Kaiyang; Li, Pei; Jacob, Andrew; Kozuch, Michael; Mowry, Todd; Skarlatos, Dimitrios (October 2023, ACM)

Serverless computing is an increasingly attractive paradigm in the cloud due to its ease of use and fine-grained pay-for-what-you-use billing. However, serverless computing poses new challenges to system design due to its short-lived function execution model. Our detailed analysis reveals that memory management is responsible for a major amount of function execution cycles. This is because functions pay the full critical-path costs of memory management in both userspace and the operating system without the opportunity to amortize these costs over their short lifetimes. To address this problem, we propose Memento, a new hardware-centric memory management design based upon our insights that memory allocations in serverless functions are typically small, and either quickly freed after allocation or freed when the function exits. Memento alleviates the overheads of serverless memory management by introducing two key mechanisms: (i) a hardware object allocator that performs in-cache memory allocation and free operations based on arenas, and (ii) a hardware page allocator that manages a small pool of physical pages used to replenish arenas of the object allocator. Together these mechanisms alleviate memory management overheads and bypass costly userspace and kernel operations. Memento naturally integrates with existing software stacks through a set of ISA extensions that enable seamless integration with multiple languages runtimes. Finally, Memento leverages the newly exposed memory allocation semantics in hardware to introduce a main memory bypass mechanism and avoid unnecessary DRAM accesses for newly allocated objects. We evaluate Memento with full-system simulations across a diverse set of containerized serverless workloads and language runtimes. The results show that Memento achieves function execution speedups ranging between 8–28% and 16% on average. Furthermore, Memento hardware allocators and main memory bypass mechanisms drastically reduce main memory traffic by 30% on average. The combined effects of Memento reduce the pricing cost of function execution by 29%. Finally, we demonstrate the applicability of Memento beyond functions, to major serverless platform operations and long-running data processing applications.
more » « less
Full Text Available
ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines

Chen, Siyuan: Fegade; Chen, Tianqi; Gibbons, Phillip; Mowry, Todd (January 2023, Proceedings of Machine Learning Research)

Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to high data movement costs. In this paper, we provide an approach for batching dynamic DNNs based on finite state machines, which enables the automatic discovery of batching policies specialized for each DNN via reinforcement learning. Moreover, we find that memory planning that is aware of the batching policy can save significant data movement overheads, which is automated by a PQ tree-based algorithm we introduce. Experimental results show that our framework speeds up state-of-the-art frameworks by on average 1.15x, 1.39x, and 2.45x for chain-based, tree-based, and lattice-based DNNs across CPU and GPU. The framework is open-sourced at https://github.com/gulang2019/ED-Batch.git.
more » « less
Full Text Available
RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture

https://doi.org/10.1109/MICRO56248.2022.00046

Gobieski, Graham; Ghosh, Souradip; Heule, Marijn; Mowry, Todd; Nowatzki, Tony; Beckmann, Nathan; Lucia, Brandon (October 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO))

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally. Recent work has proven coarse-grained reconfigurable arrays (CGRAs) as a promising architecture for this domain. Unfortunately, nearly all prior CGRAs support only computations with simple control flow and no memory aliasing (e.g., affine inner loops), causing an Amdahl efficiency bottleneck as non-trivial fractions of programs must run on an inefficient von Neumann core.RipTide is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency, eliminating this bottleneck. RipTide provides a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. RipTide implements these primitives without tagged tokens to save energy; this requires careful ordering analysis in the compiler to guarantee correctness. RipTide further saves energy and area by offloading most control operations into its programmable on-chip network, where they can re-use existing network switches. RipTide’s compiler is implemented in LLVM, and its hardware is synthesized in Intel 22FFL. RipTide compiles applications written in C while saving 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6 × energy v. a von Neumann core.
more » « less
Full Text Available
Cortex: A Compiler for Recursive Deep Learning Models

Fegade, Pratik; Chen, Tianqi; Gibbons, Phillip B; Mowry, Todd C (April 2021, Proceedings of the 4th Conference on Machine Learning and Systems, MLSys'21)

Full Text Available
Filter Representation in Vectorized Query Execution

https://doi.org/10.1145/3465998.3466009

Ngom, Amadou; Menon, Prashanth; Butrovich, Matthew; Ma, Lin; Lim, Wan Shen; Mowry, Todd C.; Pavlo, Andrew (June 2021, DAMON)

Full Text Available
Permutable compiled queries: dynamically adapting compiled queries without recompiling

https://doi.org/10.14778/3425879.3425882

Menon, Prashanth; Ngom, Amadou; Ma, Lin; Mowry, Todd C.; Pavlo, Andrew (October 2020, Proceedings of the VLDB Endowment)
null (Ed.)
Full Text Available
Relaxed operator fusion for in-memory databases: making compilation, vectorization, and prefetching work together at last

https://doi.org/10.14778/3151113.3151114

Menon, Prashanth; Mowry, Todd C.; Pavlo, Andrew (September 2017, Proceedings of the VLDB Endowment)

Full Text Available

Search for: All records